feat(writer): Add clustered and fanout writer #1735

CTTY · 2025-10-10T04:04:56Z

Which issue does this PR close?

Closes Implement fanout partitioned data writer. #1572 Implement clustered partitioned data writer. #1573

What changes are included in this PR?

New:

Added new partitioning module with PartitioningWriter trait
ClusteredDataWriter: Optimized for pre-sorted data, requires writing in partition order
FanoutDataWriter: Flexible writer that can handle data from any partition at any time

Modification:

(BREAKING) Modified DataFileWriterBuilder to support dynamic partition assignment
Updated DataFusion integration to use the new writer API

Are these changes tested?

Added unit tests

CTTY · 2025-10-10T04:05:47Z

crates/iceberg/src/writer/mod.rs

-    /// Build the iceberg writer.
-    async fn build(self) -> Result<Self::R>;
+    /// Build the iceberg writer for an optional partition key.
+    async fn build_with_partition(self, partition_key: Option<PartitionKey>) -> Result<Self::R>;


This is a breaking change. I believe this is necessary because:

IcebergWriter is supposed to generate DataFile that always hold a partition value according to iceberg spec.

The existing code store partition value in the builder directly, making builder.clone() useless:

let builder = IcebergWriterBuilder::new(partition_A); let writer_A = builder.build(); ... // write to partition A // done with partition A and now we need to write to partition B // this is wrong because partition value A is still stored in the builder let writer_B = builder.clone().build()

An alternative is to add a new method clone_with_partition() but that would also be a breaking change and it's less clean compared to build_with_partition()

CTTY · 2025-10-10T04:16:54Z

crates/iceberg/src/writer/partitioning/clustered_data_writer.rs

+
+/// A writer that writes data to a single partition at a time.
+#[derive(Clone)]
+pub struct ClusteredDataWriter<B: IcebergWriterBuilder> {


ClusteredDataWriter and FanoutDataWriter are supposed to only work with DefaultInput and DefaultOutput.

I tried including generic i/o few weeks ago and remember there were tons of tricky nuances and decided to just go for the default IO type for now. For other IO type(e.g. PositionalDeleteInput), we will need add another implementation later.

Also maybe we should name these DefaultClustered/FanoutWriter to avoid confusion since they can also write equality deletes?

Add clustered and fanout writer

b44757e

CTTY commented Oct 10, 2025

View reviewed changes

CTTY added 3 commits October 9, 2025 21:34

fix usages

36bac11

daily clippy fix

34917d4

Merge branch 'main' into ctty/parpar-new

97ed0ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(writer): Add clustered and fanout writer #1735

feat(writer): Add clustered and fanout writer #1735

Uh oh!

CTTY commented Oct 10, 2025

Uh oh!

CTTY Oct 10, 2025 •

edited

Loading

Uh oh!

CTTY Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(writer): Add clustered and fanout writer #1735

Are you sure you want to change the base?

feat(writer): Add clustered and fanout writer #1735

Uh oh!

Conversation

CTTY commented Oct 10, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

CTTY Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CTTY Oct 10, 2025 •

edited

Loading

CTTY Oct 10, 2025 •

edited

Loading